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The universal scalability law ( USL ) is an analytic model used to quantify application 
scaling. It is universal because it subsumes Amdahl's law and Gustafson linearized 
scaling as special cases. Using simulation, we show: (i) that the USL is equivalent 
to synchronous queueing in a load-dependent machine repairman model and (ii) 
how USL, Amdahl's law and Gustafson scaling can be regarded as boundaries 
defining three scalability zones. Typical throughput measurements lie across all 
three zones. Simulation scenarios provide deeper insight into queueing effects and 
thus provide a clearer indication of which application features should be tuned to 
get into the optimal performance zone. 



INTRODUCTION 



The 2008 JavaOne conference included many presenta- 
tions on techniques for achieving better scalability, e.g., 
caching, collocation, "parallelization," and pooling. But 
these are only qualitative descriptions. How can the im- 
pact of applying such techniques be quantified? Clearly, 
this is the domain of performance modeling. Perfor- 
mance models are essential, not only for prediction, but 
also for /nterpret/ng scalability measurements. However, 
most performance modeling tools, e.g., PDQ [GunOS ] 
and SIMUL8 [Hol04b], use a queueing paradigm which 
requires measured service times as inputs to parameter- 
ize each queueing facility being modeled. More often 
than not, such detailed measurements are not available, 
thereby thwarting this approach. 

A more practical approach is based on statistical re- 
gression [Hol04a] of measured throughput data using 
parametric models; the advantage being that service- 
time measurements are not required. One paramet- 
ric model that has been used successfully for the past 
two decades is based on the universal scalability law or 
USL |G un93l IGun07l IGunOB] , A distinguishing feature 
of the USL is its ability to analytically model the ret- 
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rograde throughput (Fig. [T]) commonly seen in custom 
benchmarks and load-test measurements. If such retro- 
grade behavior is not present, the USL reduces to either 



Amdahl's law (See Sect. 2.2 1 or Gustafson's linearized 
form (See Sect.[Z3l). 
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Figure 1: USL parametric model (red) fitted to Web- 
sphere relative capacity measurements C{N) as a func- 
tion of user load N with coefficients a = 0.18169 and 
f3 = 0.00047. Retrograde performance is clear. Amdahl 
and Gustafson parametric models cannot accommodate 
this effect 



As useful as all this has been, the question has remained: 
Do parametric models like USL represent something more 
fundamental? We answer that question in the affirma- 



tive by showing that the USL corresponds to a certain 
bounding curve on the throughput of a machine repair- 
man (MRM) queueing model [GH98]. We use event- 
based simulation as an exploratory tool to investigate the 
precise conditions under which such a bound can exist. 
Amdahl's law and Gustafson's linearization are contained 
as special cases of the MRM model. Based on this new 
insight, we introduce the concept of scalability zones. 
Rather than relying on any particular bounding curve to 
express scalability, we show how each of these bound- 
ing curves defines a set of zones. In practice, typical 
throughput measurements lie across all these zones and 
this provides a more helpful interpretation for determin- 
ing potential performance improvements. 

Our paper is organized as follows. In Sect. [2] we review 
each of the parametric scalability models discussed sub- 
sequently. In Sect.[3]we review the repairman model and 
the generalizations necessary to make contact with the 
parametric models. These extensions include: (i) a prep- 
ping repairman in Sect. 3.2 and (ii) synchronous queue- 



ing in Sect. 3_3 Sect.|4 describes the simulation models 
and presents the results that support the identification 
of the USL with synchronous queueing at a prepping re- 
pairman, together with several special cases. Finally, in 
Sect. [5] we discuss a new approach to interpreting scal- 
ability data in terms of zones and transitions between 
them. These zones are well-defined in terms of queueing 
effects and thus, can provide vital insights into how best 
to improve scalability. An example of applying the zone 
concept to actual scalability measurements is presented. 



2 PARAMETRIC MODELS 

We begin our analysis by defining the relative capacity. 

CiN)^X{N)/X{l) (1) 

where X{N) represents the throughput generated by ei- 
ther, N processors in the case of hardware scalability [See 
IGun07l Chap. 4] or N virtual users in the case of soft- 
ware scalability [See lGunOT. Chap. 6]. The ratio in ([!]) 
has two possible interpretations: 

Data representation: X{N) represents the actual 
throughput measurements, e.g., transactions per 
second. The relative capacity C{N) is simply the 
normalization of those data. See Fig.[T] 

Analytic representation: X{N) is represented by a 
function, e.g., a linear regression model such as: 
X{N) = mN + c where m is the slope and c is the 
intercept. See Sect. |4.4| 



With regard to each of the parametric models discussed 
in this section, we shall demonstrate that C{N) is best 
represented, not by a simple linear function, but by a 
ratio of functions: 



C{N) = P{N)/Q{N) 



(2) 



where P{N) and Q{N) are polynomials in N. Such func- 
tions are called rational functions (See 'en . wikipedia . 
org/wiki/Rational_f unctionj . 
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Figure 2: Parametric models: USL (red), Amdahl 
(green), Gustafson (blue), with parameter values exag- 
gerated to distinguish their typical characteristics relative 
to ideal linearity (dashed black). The horizontal dashed 
green line is the Amdahl asymptote at a^^ . Compare 
with the application of the USL model in Fig.jl] 

We pause to reflect on the significance of ([2]). Computer 
system scalability can be modeled using many possible 
functions which are not rational functions, e.g., geomet- 
ric scaling [Gun96| . As we shall show in Sects. |3]and|4] 
rational functions are physically correct because they pos- 
sess a deep connection with queueing theory. As far 
as we are aware, this fundamental relationship between 
parametric models and queueing models has not been 
elucidated before. Conversely, geometric scaling can be 
excluded on the grounds that it is unphysical from the 
standpoint of queueing theory [Gun07l IGun02) . 



2.1 Universal Scalability Law 

The most general parametric model of scalability is the 
two-parameter universal scalability law (USL): 



CiN,a,p) 



N 



l + a{N -1) + f3N{N -1) 



(3) 



which is a rational function with P{N) = N and 
Q{N) = aN'^ + bN + c; a quadratic polynomial with co- 
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Table 1: Application classes for the USL model 



B 



D 



Ideal concurrency {a, /? = 0) 



Single-threaded tasks 
Parallel text search 
Read-only queries 



Contention-limited {a> 0,(3 — 0) 



Tasks requiring locking or sequencing 

Message-passing protocols 

Polling protocols (e.g., hypervisors) 



Coherency-limited (a — 0,/3 > 0) 



SMP cache pinging 

Incoherent application state between 

cluster nodes 



Worst case {a, (3 > 0) 
Tasks acting on shared-writable data 
Online reservation systems 
Updating database records 



efficients a,b,c > 0. These coefficients have been re- 
grouped into three terms involving only two parameters 
a,(3 >0, in the denominator of ([s]). These three terms 
can be interpreted as the "Three C's": 



Concurrency-limited scalability when a, (3 — 
such that C{N) ^ N, i.e., linear scaling. 

C ontention-limited scalability due to serialization 
or queueing, i.e., when a > 0, (3 = 0. 



C oherency-limited scalability due to the delay in- 



curred by making local copies of data or instruc- 
tions consistent across multiple caches or nodes, i.e., 
when a,(3 > 0. 

Table [l] summarizes how these parameter values can be 
used to classify the scalability of different types of plat- 
forms and applications. 



2.2 Amdahl's Law 



Amdahl's law [Amd67] corresponds to the special case 
of the USL equation ([s]) with /3 — 0. Typically, it is used 
to quantify the achievable speedup: 



CAiN,a) 



N 



l + a{N -I) 



(4) 



for fine-grained parallelism. Equation (Q is a rational 
function with P{N) = N and Q{N) = bN + c; a linear 
polynomial with coefficient values a,b,c> 0. 

Amdahl's law assumes that a single workload comprises 
a parallel portion and a remaining serial portion. The 



serial portion or serial fraction, < a < 1, is the aggre- 
gate fraction of the workload that can only be executed 
sequentially on a single processor, i.e., the non-parallel 
portion. 

Another fundamental assumption is that the parallel por- 
tion of the workload can be partitioned into N equal 
sub-tasks. If the size of these sub-tasks can be made 
progressively smaller, then the elapsed execution time 
will become dominated by the serial fraction such that 
C{N) ^ 1/a as N ^ oo. In other words, there is an 
asymptotic ceiling on achievable speedup shown as the 
horizontal line in Fig. [2] 



2.3 Gustafson's Linearization 

Amdahl's law assumes the size of the work is fixed. 
Gustafson's modification |Gus88] is based on the idea of 
scaling up the size of the work to match the proces- 
sors [See IGunOTl Chap. 4]. This rescaling of the work- 
load results in the theoretical recovery of linear speedup 



CG{N,a) = {l-a)N + a 



(5) 



which is a rational function with P{N) = bN+c; a linear 
polynomial with coefficients a,b,c>0 and Q{N) = 1, 
trivially. 

Unlike the USL, ^ exhibits the peculiarity that 
Cg(0,q:)=Q!, i.e., there is non-zero capacity even if 
there are no processors in the system! This artifact is 
shown as the blue line near the origin in Fig. |2]and must 
be regarded as an unphysical side-effect of the Gustafson 
linearization of Amdahl's law. 

Although CciN^a) has inspired various efforts for im- 
proving parallel processing efficiencies, achieving truly 
linear speedup has turned out to be extremely difficult 
in practice. Most recently, ^ has been proposed as a 
way to "break Amdahl" scalability for threaded applica- 
tions running on multicore processors [Sut08| . Whether 
this claim will be effective, remains to be seen. There 
is a significant literature of failed proposals for defeat- 
ing Amdahl's law [See, e.g., IKH92I INel96l IPre95| . See 
Sect.[5|for some additional remarks based on the results 
of this paper. 



2.4 Retrograde Scalability 

The key difference between the USL and the other para- 
metric models lies in the fact that only the USL can suc- 
cessfully predict retrograo'e scaling. See Fig. [2] In other 
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words, if we think of Gustafson's linear scalability as cor- 
responding to "equal bang for the buck," and Amdahl's 
law as representing "diminishing returns," then the USL 
represents "negative return on investment," or negative 
ROI. 

Such negative ROI effects in application scalability are 
not the exception but the norm. Figure [T] shows an ex- 
ample of WebSphere benchmark data fitted using Math- 
ematica. The retrograde effect is manifest. It is in this 
sense that the USL is considered to be universal. 

Theorem 1 (Universality). The necessary and suf- 
ficient condition for the relative capacity C{N) to 
be a universal scalability model is P{N) = N and 
Q{N) = aN"^ + bN + c with coefficients a,b,c> 0. 

Proof 1. The proof is best demonstrated by consid- 
ering latency rather than throughput. See JGunOSf for 
details of this proof. □ 

Interestingly, the proof establishes a similarity between 
the USL and Brooks' law [Bro95] for the management 
of software projects, viz., "Adding more manpower to a 
late software project makes it later." In this case, is 
interpreted as people rather than processes or processors. 
Brooks' law is the analog of the negative ROI mentioned 
earlier. 

Equation ([s]) clearly satisfies theorem [l] Next, we show 
that each of these parametric models has a more funda- 
mental meaning because they correspond to bounds on 
throughput for a well-defined queueing model. 



3 QUEUEING MODELS 



In Sects. [T2] and \33\ we develop some generalizations 
of the MRM that do not appear in the queueing-theory 
literature. First, we briefly review the standard MRM. 



3.1 Standard Repairman 



The repairman queueing model [GH98] is shown 
schematically in Fig. [s]) and represents an assembly line 
comprising a finite number of machines p which break 
down after a mean lifetime Z. A repairman takes a 
mean time S to repair each broken machine. If mul- 
tiple machines fail, the additional machines must queue 
for service in FIFO order. The queue-theoretic notation 
for the MRM model, M/M/1//N, implies exponentially 
distributed lifetimes and service periods with a finite pop- 
ulation N of requests and buffering. 
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Figure 3: Machine repairman schematic 



In steady state, ZX machines are "up", while Q are 
"down" for repairs, such that the total number of ma- 
chines in either state is given by iV = Q -|- ZX. Rear- 
ranging this expression we have: 



Q = N-ZX 
and applying Little's law Q = XR to 

XR^ N- ZX 



produces 



N 



X{N) 



- Z 



(6) 



(7) 



(8) 



for the mean residence time at the repair station. Re- 
arranging ^ provides an expression for the mean MRM 
throughput as a function of N: 



X{N) 



N 



R{N) 



(9) 



Several solutions of X{N)/X{1) are shown in Fig. |4] 

MRM has its origins in operations research associated 
with manufacturing systems |GH98| . However, because 
all queueing models are only abstractions, MRM has 
found more widespread applications. The two most im- 
portant applications for our discussion are: 

Time-share systems: N represents users making re- 
quests to a central processing facility. An MRM 
queueing model was applied to the performance 
analysis of a research time-share computing system 
called CTSS [Sch67] : the precursor to UNIX. 

Multiprocessors: N represents CPUs or cores and the 
MRM represents the interconnect that allows the 
CPUs to communicate and share data [AMBC90I 
[RF87j . 
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Figure 4: MRM throughput curves with normalized sat- 
uration values for round-trip times equal to 10, 500 and 
1000 service units. No degradation is possible for any 
choice of N, S or Z 

Table 2: Interpretation of the queueing metrics in Fig.js] 
where MRM: machine repairman, CMP: core multipro- 
cessor and TSS: time-share system 



Metric 


Model 


Interpretation 




MRM 


machines 


N 


CMP 


processors, cores 




TSS 


processes, users 




MRM 


up time 


Z 


CMP 


execution period 




TSS 


think time 




MRM 


service time 


s 


CMP 


transmission time 




TSS 


CPU time 




MRM 


residence time 


Rip) 


CMP 


interconnect latency 




TSS 


queueing time 




MRM 


failure rate 


Hp) 


CMP 


bus bandwidth 




TSS 


appln. throughput 



A summary of how to interpret the MRM queueing vari- 
ables in each of the described cases is provided in Ta- 
ble [2] The multiplicity of MRM interpretations justifies 
the statement in Sect. [2] that these parametric models 
can be applied to both hardware and software scalability 
analysis. In particular, since the USL model ([s]) does not 
presume any particular type of application or topology, 
it can be applied equally well from multi-cores to multi- 
tier systems. That detailed information is present in the 
USL, but it is encoded in the two parameters: a and [3. 

Having reviewed the standard MRM, we new develop 
some generalizations that will be needed to make the 
connection between the MRM and the USL explicit. 



These generalizations are: (i) state-dependent service 
times and (ii) synchronous queueing, which we treat in 
Sects. [T2| and |3.3| respectively. 



3.2 Prepping Repairman 

Figure |4] shows that the mean throughput is approxi- 
mately linear for values of TV near the origin (i.e., low 
load) and reaches a saturation plateau at high loads when 
N > S/{S + Z). Retrograde throughput, of the type ex- 
hibited in Fig. [T] is not possible in the standard MRM 
for any choice of N, S or Z. However, we do know that 
retrograde throughput is associated with load-dependent 
servers in queueing models [See e.g., Gun05 Chap. 10]. 

The question then becomes: What should be the form 
of the load-dependency such that the MRM produces 
exactly the retrograde throughput exhibited by the USL? 
In principle, such load-dependence could take any form. 
In Sects. [373| and [43| we show, rather surprisingly, that 
simple //near load-dependence is required to produce USL 
behavior in the MRM. 

Linear load-dependence in the context of the MRM 
means that the repairman has to prepare up to N failed 
machines in some way, e.g., rank them, prior to actually 
servicing them. In general, such ranking will involve pair- 
wise comparisons and this introduces an additional delay 
that grows binomially with N, i.e., (^) = N{N - l)/2. 
Up to a factor of 2, this is precisely the term in the de- 
nominator of It is the queueing analog of negative 
ROI discussed in Sect. 12.41 

It is important to realize that this additional delay due to 
preparations, is suffered by a// of the enqueued machines 
before the repairman commences service. For brevity, 
we shall hereafter refer to this kind of load-dependent 
repairman as the prepping repairman. 



3.3 Synchronous Queueing 

One more condition is necessary in order to establish the 
connection between the parametric models of Sect. [2]and 
the MRM, viz., synclironous queueing. The reason for 
this requirement stems from the fact that the residence 
time R{N) in ^ can be quite arbitrary and, mathemat- 
ically speaking, may not even possess an analytic form. 
But even if R{N) does have an analytic form, it is un- 
likely to be a polynomial in N and thus, will not produce 
rational functions like Q. 
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If, however, all machines were to break down simul- 
taneously, the queue length at the repairman would 
be maximized such that the residence time becomes 
R{N) — NS, i.e., one machine in service and [N — 
1) waiting. Synchronous queueing produces worst- 
case throughput and it therefore represents a lower 
bound |ZSEG82| on i^: 



N 



NS+Z 



< X{N) 



(10) 



In the context of multiprocessor scalability (see Table[2]), 
it is tantamount to all N processors simultaneously ex- 
changing data or sending messages across the intercon- 
nect. 

It is this synchronous queueing condition that causes 
the throughput ([o]), in both the standard and load- 
dependent MRM models, to conform to ([2]) and thus 
provide the connection with the rational functions dis- 
cussed in Sect. 121 

Theorem 2 (Main theorem). The universal scala- 
bility law ^ is equivalent to the synchronous bound 
on relative capacity in the MRM with linear load- 
dependent service rate. 

Proof 2 (Sketch). Under synchronous queueing in 
MRM, when the first request is in service the mean 
waiting time for the remainder is given by 



W={Q-1)S 



(11) 



where Q is the mean number of requests in the system. 
Now, let the service time be load- dependent such that: 

SiQ)^cQS 

with c a constant of proportionality. For synchronous 



queueing Q = N, so we can rewrite (11) as: 

W ^ c{Q ~l)QS ^ cN {N ~l)S (12) 



Expressed as relative relative throughput, jj^P appears 
in the denominator of ^ as the N{N — 1) term. The 
detailed proof appears in tGunOSf . □ 

If we consider synchronous queueing in the standard 
MRM, i.e., without load-dependent service, we recover 
Amdahl's law, which is also the special case of USL with 
/3 = in ([3]). 

Corollary 1. Amdahl's law is equivalent to the 
relative throughput due to synchronous queueing in the 
standard MRM with mean execution time Z and con- 
stant service rate S. 



Proof 3. The proof requires the identity 

S 



s + z 



(13) 



See lGun07\ Appendix A] and JGunO^ for details. □ 

Corollary 2. Gustafson's law ^ corresponds to the 
rescaling Z l—^ pZ in the MRM. 



Proof 4. See iGunOSf as well as the simulation results 
in Sect. \4.4\ 



The precise nature of the synchronization discussed here 
turns out to be rather subtle. To see this, consider the 
case where all machines have the same deterministic Z 
period. At the end of the first Z period, all N machines 
will enqueue at the MRM simultaneously. By definition, 
however, the machines are serviced serially, so they will 
return to the the operational phase (parallel execution 
in the top portion of Fig. [s]) separately and thereafter 
will always return to the repairman at different times. In 
other words, even if the queueing system is started with 
synchronized visits to the repairman, that synchroniza- 
tion is immediately lost after the first tour because it is 
destroyed by the very process of queueing. 

Unfortunately, the analytic equations used in this sec- 
tion only provide a steady-state view of the MRM, so we 
cannot discern the details of the stable synchronization 
process. Consequently, we turn to discrete-event simula- 
tion models [Hol04b] as an exploratory tool, rather than 
a predictive tool. As we shall see in Sect. [5] the syn- 
chronous MRM simulations also provide deeper insight 
into potential performance tuning opportunities in real 
systems. 



4 SIMULATION MODELS 



From a basic modeling standpoint, using discrete event 
simulation, these look relatively the same with a fixed 
number of requests circulating through a closed system 
with a wait time and service time associated with them. 
The basic model is shown in Fig. [5] 

There is a place where the requests are running in parallel 
on multiple CPUs, and another place where the requests 
are serialized on a single CPU. The cases differ in how 
requests make the transitions between the two modes of 
operation. 

The simulation models are intended to clarify how the 
analytical results in Sects. [2] and [3] relate to the way 
an actual application executes, (cf. Fig. [T]) They will 
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Figure 5: SIMUL8 model corresponding to Fig. [s] 



be used to demonstrate what features of an architecture 
might drive an application to the various performance 



regions or zones defined in Fig. 15 Some of us might be 
mathematically challenged and prefer to see something 
running on a real platform, on which we can measure 
performance, that exhibit the characteristics defined by 
the analytical equations. In lieu of a real platform, we use 
simulation models. The criteria used in each of the mod- 
els on how the parameters are used and how they related 
to the real world will be defined. Hopefully this will pro- 
vide the reader with a better understanding of how each 
of these models are representative of situations that they 
might have seen on their own computer systems. 

We have presented a number of mathematical equations 
that represent the expected throughput of various models 
of computer scalability. There are four models discussed 
in this section for which a discrete simulation queueing 
model will be created to show that there is a correlation 
between what the analytic models predict and the results 
of the simulation models (and thus real platforms). 

Discrete event simulation can be defined as "the oper- 
ation of a system is represented as a chronological se- 
quence of events. A common exercise in learning how to 
build discrete- event simulations is to model a queue, such 
as customers arriving at a bank to be served by a teller. 
In this example, the system entities are CUSTOMER- 
QUEUE and TELLERS." As mentioned in Sect. [T] this 
approach has been used to create models for computer 
performance evaluation. Simulation represents a real sys- 
tem by modeling the important characteristics. For the 
models in this paper, there are three objects that are 



used in the modeling: 

1. Resource to be consumed (CPU cycles in this 
case) 

2. Consumer of the resource (CPU actively working 
on a request) 

3. Queue to hold requests for the CPU if it is not 
available 

A word of clarification might be helpful in light of the 
comment made in Sect, [l] regarding the common im- 
pass of needing to parameterize queueing models with 
service times that are often not available in standard 
performance measurements. We are not using simula- 
tion models to make performance predictions in the usual 
sense. The simulation models discussed in this paper are 
constructed to explore the underlying dynamics of the 
analytic scalability models in Sect. [2] and [3] In order 
to reveal the connection between the simulation models 
and the analytic models, it turns out that we only need 
to define the ratios between the queueing variables N, 
S and Z in Table [2] The actual numbers can be any 
numeric values, e.g., S = 1.0 second. In this sense, we 
are free to construct our simulation models because they 
do not require measured service times. 



4.1 Repairman Simulation 



The first simulation model we consider is the standard 



MRM defined in Sect. 3.1 This model is a closed queue- 
ing model and can be used to explain the performance 
of a system where jobs can run in parallel (no contention 
for the CPU resource), but every so often they enter a 
serial activity where only one request can be processed 
at a time (e.g., programs requesting a lock on a com- 
mon table). In this case multiple requests may queue up 
waiting to progress through the serial portion. The other 
characteristic of the MRM is that it is also defined as an 
asynchronous model in that the jobs may enter the serial 
portion independent of one another. 

This model (Fig. [5]) was constructed using the SIMUL8 
event-based simulator. Here the flow of work in closed 
system is represented. As soon as the serial portion is 
completed, the request can start execution in parallel 
with other requests. The "Serial" (S) and "Parallel" (P) 
objects are consumers of the "CPU" resource, which in 
this case has 100 CPUs available so that all 100 requests 
can execute at the same time in the Parallel object. The 
Serial object has the restriction that only one CPU can 
be used at a time, forcing outstanding requests to be 
queued and processed sequentially. 
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Besides the number of CPUs, the other important pa- 
rameter in the model in the percent of CPU time spent 
in the serial execution as compared to the parallel execu- 
tion. For all the models in this paper, 10% of the time is 
serial operation. If the time to process a unit of work is 
1 second, then 0.9 seconds are spent in the parallel path 
and 0.1 seconds in the serial path. The bottleneck of the 
system is the serial path and will limit the throughput to 
10 jobs per second (1/0.1). In the model the timings in 
the Parallel and Serial objects are an exponential distri- 
bution based on the service time given above. 

So what we want to do in the model is to run a series of 
configurations and measure the throughput of the sys- 
tem. In this case the percentage in the serial path is 
fixed and the number of initial requests is varied. The 
number of available CPUs is set to the number of initial 
requests. The throughput will be normalized to a system 
with a single request. Plotting this curve will provide an 
indication of the effect of adding more CPUs to handle 
the load as the number of requests in the system also 
increases. Fig. [6] is the result of the model using the 
percentages above. 




20 40 



60 100 



Figure 6: Throughput of the conventional MRM 

The X-axis presents the number of CPUs (or requests) 
in the system. As you can see, we reach a normalized 
throughput of 10 relatively quickly. Plotted on the sec- 
ondary y-axis on the right is the system utilization as 
would be measured by the operating system. Notice that 
when we reach 100 CPUs, the overall system utilization 
is only 10%, or as indicated by the graph, only 10 CPUs 
are doing useful work; the other are idle since the tasks 
that were executing on them are in the queue waiting for 
serial execution. 

From Table[2] we know that the MRM is representative 
of a multi-user (N), client/server, system where asyn- 
chronous requests are contending for a common resource. 
To minimize loss of throughput, the serial portion must 



be reduced or the work partitioned so the CPUs can be 
used more effectively. 

The model can provide the direct numbers in the case 
of the MRM (e.g., an average of 89.7 requests in the 
queue and a response time of 9.07 seconds for a request 
to make it through the serial portion). But we will see 
in the other models we will not be able to directly mea- 
sure these values because of the way the model handles 
the synchronization of the Parallel and Serial operation. 
But we can derive these values by keeping track of the 
amount of CPU consumed in the Parallel and Serial com- 
ponents. 

For the case of the MRM with N = 100 requests, the 
model was run for 3,000 seconds, and 27,206 CPU sec- 
onds were consumed in Parallel and 2,999 in the Serial 
object. There were 30, 118 requests processed in this 
time. We can also determine the average number of 
requests by evaluating the number of CPU sec/sec con- 
sumed, which will indicate the number of CPUs required 
to handle the requests. 



27206 CPU 

3000 sec 
2999 CPU 

3000 sec ~ 



= 9.07 CPU sec/sec 
1.00 CPU sec/sec 



We have the number of requests in the Parallel and Serial 
objects, so therefore the remainder must be in the queue 
for the Serial object: 

Nq = 100 - 9.07 - 1.00 = 89.93 

The response time can be computed using Little's law: 

p Nq + Ns 89.93 + 1.00 
^^^^^ 30118/3000 =^-°'"^ 

where X is the throughput of the MRM. 



4.2 Synchronization Gate 

The difference between the MRM and Amdahl models 
is the way that work is returned to the parallel/think 
processing node. In standard MRM, you can imagine 
the repairman being presented with a bin of parts that 
need so have service done on them before returning to 
operation. The repairman will take a part out of the 
"bin," do something to it for time 5*, and then return 
the part back into operation before picking out the next 
one from the bin. 

In the case of Amdahl, the repairman will still receive a 
bin of parts to service, but instead of immediately return- 
ing the part to operation after it has been serviced, the 
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repairman put the part in an output bin and only when 
the input bin has been completely serviced are all the 
parts returned to operation. As shown in Fig. [7] there 
is effectively a "gate" that prevents the release of parts 
until the repairman has repaired all of them. 



Parallel Execution 




Sync 



Gate 



Serial Execution 










> 





96 



Figure 7: MRM with synchronization gate 



Only after all the jobs were in the right buffer after the 
serial execution were they released back to the parallel 
execution. But this had some problems in that in the 
real world jobs do not necessarily all stop at the same 
time and request some serial operation. There were also 
some limitations on the types of distributions that could 
be used for the serial and parallel timings. To overcome 
these limitations, consider a time sharing system where 
is one of the jobs needs to run on the serial path, it will 
stop or suspend all the jobs on the parallel path until it 
completes. That is effectively what the gate is doing in 
the figure above. The only impact on the parallel jobs is 
that their elapsed time has been increased by the time it 
took the serial job to execute. With this approach, jobs 
can come out of the parallel path in a random fashion 
and the overall effect is the same as using a gate. 

So the new model of the system looks like Fig. [8] where 
as soon as a job starts execution on the serial path, a 
suspend signal is sent to the parallel portion (could be 
on the next clock tick all the jobs are put in a suspended 
mode). The serial job will execute and when it is com- 
plete, the suspended jobs will be restarted and the job 
that completed the serial path is returned to execute in 
the parallel path. 



4.3 Amdahl Simulation 

Figure [8] is very similar to the MRM (Fig. [5]) with the 
exception that if there is a request running in the Serial 
object, the processing in the Parallel object will be sus- 
pended for the duration of the Serial execution. This is 
equivalent to all the requests trying to get at the same 
resource at the same time and each one is serialized and 
none can restart until all the serial requests have been 
handled. 

In the simulation model, when a request starts to operate 
in the Serial object, a "breakdown" will be generated for 
the Parallel object. All work will stop (be suspended) in 
the Parallel object for the duration of execution on the 
Serial object and will then pick up where they left off. 
This effectively extends the elapsed time in Parallel, but 
the CPU time stays the same. This is contrasted with 
the MRM, where the elapsed time was the same as the 
CPU time in the Parallel object. 



Parallel Execution 
100 




Stop Parallel 
execution when 
Serial is 
running. 
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Serial Execution 
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Figure 8: Amdahl model in SIMUL8 

In Fig[9] the points represent the output from the model 
and the green line represents the normalized throughput 
(speedup) as predicted by Amdahl's law. The model, 
which represents a "synchronous" MRM, correlates ex- 
actly with the predicted results from Amdahl's law. The 
limit in this case is also a speedup of 10. 

The reason for calculating the number of requests in each 
object, and the response time, in the following manner 
is that if you look in the model, the number of requests 
in the queue averages zero. This is because as soon as a 
request is completed in the Parallel object, it is sent to 
the Serial object where it immediately begins execution, 
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20 40 



Figure 9: Amdahl throughput 



since a request can only complete in the Parallel object if 
it is not suspended which mean there is no request being 
processed on the Serial object. 

This model was run for 3000 seconds and had a through- 
put of 27624 requests. The number of requests in each 
of the objects and the response times are: 

Np = ^t^^-^ = 8.28 CPU sec/sec 
3000 sec 

2750 CPU 

= = 0.92 CPU sec/sec 

3000 sec 

Nq = 100 - 8.28 - 0.92 = 90.80 

90.80 + 0.92 



R = 



27624/3000 



9.96 sec 



4.4 Gustafson Simulation 



The next simulation model represents Gustafson's para- 



metric equation discussed in Sect. 2.3 This simulation 



model (Fig. 10 1 is similar to the Amdahl model in that 
requests are running in parallel until they all need the 
serial portion at the same time, but instead of each re- 
quest being processed sequentially through the serial por- 
tion, all the requests are processed as a batch on a single 
CPU with a constant service time. This might represent 
locking on a common table that takes the same time to 
update, no matter how many requests are waiting. In 
the Amdahl model, each request must be processed se- 
quentially, so that when there are more requests, there 
is an additional increase in the time spent in the serial 
portion. For the Gustafson model, the time spent in the 
serial portion is constant and is restricted to using a sin- 
gle CPU. 



Refering to Fig. 11 the speedup is linear (diagonal line). 
The system utilization (curve) falls from 100% and ap- 
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Figure 10: Gustafson model in SIMUL8 



proaches an asymptote at 90% as N increases. As men- 
tioned in the last paragraph of Sect.|4] this follows from 
the ratio choice of 10% for the serial fraction defined in 



(13|. The parallel timing is a fixed/constant distribution 
so that all the requests ask for the serial portion at the 
same time and the serial timing is exponential. 

Gustafson's law is a linear function; the blue line in Fig.|2] 
Fitting a linear regression model [Hol04a] of the form 
C{N) ^mN + c produces C(iV) = 0.909iV + 0.084 
with the coefficient of determination = 1, indicat- 
ing a nearly perfect fit to the simulation data. To within 
experimental error the gradient m — 0.909 represents 
the gradient in Gustafson's law, i.e., (1 — a) = 0.90, 
since we chose a = 0.10 in all our simulations. Similarly, 
rounding the intercept value to one significant digit gives 
c ~ 0.1, in agreement with a = 0.10 as the intercept in 

©■ 

This model was run for 3000 seconds and had a through- 
put of 299608 requests. 



Np : 

Ns- 
Nq 
R 



270033 CPU 

3000 sec 
299 CPU 



3000 sec 

100 - 90.0 - 0.10 
9.9-1-0.10 



299608/3000 



90.0 CPU sec/sec 

0.10 CPU sec/sec 
9.9 
0.10 sec 



The response time remains the same as the service 
time (0.1 seconds), which is what is expected with the 
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Gustafson model. serial portion and an additional 0.1% increase in the serial 

service time for every request that is waiting in the queue. 




Figure 11: Gustafson throughput ^ r' 1 1 \ r- 

2a m GO BO IQQ 

CPUs 



4.5 USL Simulation 



Figure 13: USL throughput 



The next case is the USL model defined in Sect. 12.11 
This is similar to the Amdahl model, but with the addi- 
tion of a load-dependent serial server. What this means 
is that as the number of requests waiting for serial ser- 
vice increases, the amount of time that it take the serial 
portion to run also increases. This is equivalent to a pro- 
gram that might read through the waiting queue to pick 
the highest priority request to process. So each time a 
message is retrieved from the queue, the entire queue is 
searched. This might not take a lot of time, but as the 
number of outstanding requests increases, so does the 
serial processing time. 

Parallel Execution 
100 



Parallel 
execution stops 
when the Serial 
starts 
executing. 




t 



Serial Load Dependent 





Figure 12: USL model in SIMUL8 
This model was run with 10% of the time spent in the 



Figure [13] illustrates a retrograde speedup due to the 
increase in processing time based on the number of re- 
quests outstanding in the queue. The points on the graph 
are the output from the simulation model. The green line 
is a fit to the Universal Scalability equation. The legend 
at the bottom of the graph is output from fitting the 
data points to the equation. The results from the model 
match the equation [R^ — 1), indicating that Universal 
Scalability is the equivalent of an Amdahl model, with 
the addition of a load-dependent serial server. 

The model was run for 3000 seconds and had a through- 
put of 14727 requests. 

13245 CPU 



Nr 



3000 sec 
2867 CPU 



= 8.28 CPU sec/sec 
0.92 CPU sec/sec 



3000 sec 
100 - 4.42 - 0.96 = 94.62 



„ 94.62 + 0.96 

14727/3000 - ^^-^^ 



4.6 Generalized Distributions 

For the simulations in Sect. |4] exponential distributions 
were used in all cases since they are the basis for solving 



most analytical equations. See Sect. 3.1 But we can 



show with the simulation models that any distributions 
can be chosen for the service times, thereby extending 
the applicability of the USL. 

Figure [14] shows the USL model run with the Parallel 
service time deterministic (FIXED 0.9) and the Serial 
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service time a normal distribution (NORMAL(0.1, 0.02), 
i.e., a mean of 0.1 and a standard deviation of 0.02. The 
result is very similar to Fig. [Ts] 



CPU 



1 +ct(CPU - 1) + >.xCPUx(CPU- 1) 



a = 


0.1 




001 



^ 1 \ \ r 
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Figure 14: USL model with generalized distributions 



5 SCALABILITY ZONES 



The simulation results presented in the previous section, 
not only verify theorem [2] for the USL, but they also 
provide a deeper insight into the nature of the "Three 
C's" in Sect 
follows: 



2.1 In fact, we could reinterpret them as 



Concurrency-limited scalability (a,/3 = 0) corre- 



2. 



sponds to asynchronous queueing at the repairman, 
which is the same as the mean value solution (|9]). 

C ontention-limited scalability [a > 0, /3 — 0) cor- 
responds to synchronous queueing at the repairman. 

Coherency-limited scalability (a,/3>0) corre- 
sponds to synchronous queueing at a prepping re- 
pairman. 



The particular meaning ascribed to the word "repairman" 
can be decided upon using Table [2] Moreover, as de- 
scribed in Sect. |2.l| each of the C's is associated with a 
term in the denominator of the USL equation and, taken 
separately, each of them corresponds to a distinct scal- 
ability curve: (1) concurrent linearity, (2) synchronous 
contention (Amdahl's law), and (3) synchronous con- 
tention with load-dependent service. These curves are 
shown as dashed lines in Fig. [T5j The usual convention 
is to focus on only one of these possible curves to assess 
scalability. We propose, instead, to consider the three 
regions between these curves as defining three scalability 



Zone A: Linear scalability zone associated with asyn- 
chronous requests. 

Zone B: Amdahl-limited scalability zone associated 
with synchronous requests. 

Zone C: Coherency-limited scalability associated with 
synchronous requests and exchange-dependent ser- 
vice. 

Just as water is only one of three possible phases (viz., 
ice, water, steam) that exists in a particular temperature 
range (32°F <T < 212°F), so an application can exist 
in any one of the three performance phases or zones A, 
B or C, for a given range of user-loads (N). Similarly, 
just as water undergoes a phase transition to steam (i.e., 
boiling) with increasing temperature (T > 212°F), so an 
application can transition between zones as a function of 
increasing load. 
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Figure 15: Scalability zones. Unnormalized through- 
put data X{N) from Fig. [l] overlaid on scalability zones 
denoted A,B and C consistent with Tablejl] 

Figure [15] presents a case in point. At low load, the 
WebSphere data lies in Zone A which is bounded by 
concurrent linearity on the upper side and synchronous 
contention on the lower side. The data is closer to the 
lower bound for N < 20. The new interpretation of these 
data is that synchronous queueing appears to dominate 
scalability at low loads. At = 20, just like ice melting, 
the behavior of WebSphere changes rather dramatically 
as it starts to transition across Zone B and onto the up- 
per side of Zone C. Above N > 40, WebSphere oscillates 
along the boundary between Zones C and B. 

We know from our extended MRM model that Zone C 
means load-dependent service is superimposed on top 
of synchronous messaging in Zone B. This could be oc- 
curring for many reasons. Some examples are: priority 
sorting of the message queue, garbage collection or un- 
desirable memory leaks. We cannot provide a definite 
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explanation without a more thorough investigation of the 
architecture but that is the job of the software architect 
or the software engineer, not the performance analyst. 

However, we can help the software architect or engi- 
neer by directing their attention to the 0{N'^) perfor- 
mance degradation in the USL model and explaining that 
Zone C involves synchronous requests with exchange- 
dependent service. In this way, the zones interpretation 
can quickly narrow the range of potential causes; some- 
thing that would otherwise be very difficult to do. As 
implausible as this may seem to the casual reader, it is 
our experience that hints of this type are more than suf- 
ficient to trigger "Eureka!" moments among those most 
knowledgeable about the architectural details. 

Another practical insight, that emerges from our Zones 
view, is also worth noting. The zones in Fig. [15] sug- 
gest one way to improve throughput performance, viz., 
attempt to replace the synchronous messaging, evident 
in Zones B and C, with asynchronous messaging. This 
strategy is analogous to the well-known performance 
gains that can be achieved by replacing synchronous 
(blocking) I/O with asynchronous (non-blocking) I/O. 
See iwiki/Asynchronous-l /O) 

Clearly, asynchronous messaging should make through- 
put scale almost linearly (Gustafson's dream of Sects. 2.3 
and 4.4 1 but only for very low loads. Beyond iV ~ 10 



users, the throughput reaches saturation and ([8|) tells us 
that user response-time will begin to climb up the prover- 
bial "hockey stick" handle. However, such linearity may 
not be desirable from a performance management per- 
spective. It may be preferable to reach saturation more 
slowly and accommodate more aggregate users. Since 
this is tantamount to keeping on the upper side of Zone 
B, it is only necessary to ensure that the prepping re- 
pairman effect be minimized in the application. It is not 
necessary to eliminate synchronous messaging. 



and Gustafson scaling laws are also unified by the same 
queueing model, viz., the machine-repairman model. 
Moreover, corollary [l] is a lower bound on throughput; 
synchronous throughput, and therefore represents worst- 
case scalability. With this physical interpretation, it fol- 
lows immediately that Amdahl's law can be "defeated" 
more conveniently than proposed in [Nel96] by simply 
requiring that all requests be issued asynchronously. 

To understand the USL in terms of the machine repair- 
man, the standard queueing model had to be extended 



to include: (i) synchronous queueing (Sect. 3.3) and (ii) 



state-dependent service (Sects. 3.2 The precise nature 
of the synchronous queueing was only revealed by simu- 
lation, because the analytic equations used in the proof 
of theorem [2] are steady-state equations. Consequently, 
they hide the details of how the synchronization occurs, 
as well as obscuring how it controls the possible statisti- 
cal distributions of the S and Z times in Table[2l 

The simulation models provide a more intuitive under- 
standing of how all these effects combine in a non- 
mathematical way. They also reveal how "real world" ap- 
plications might behave (Table[T]) and how this behavior 
is reflected in the parametric models used for statistical 
regression (Tablejl]). It is this concrete physical interpre- 
tation of the USL regression parameters that make it a 
more practical tool than the traditional queueing-model 
approach for assessing application scalability. 

Finally, our investigations have led us to abandon the 
usual goal of fitting any particular nonlinear scalability 
model to data. Rather, we treat the data as dynamic 
and thus capable of making transitions between scalabil- 
ity zones (Sect. [5]) as a function of load. Each of these 
zones comes with a well-defined interpretation in terms 
of queueing effects and this can be vital for system archi- 
tects and performance engineers when considering how 
to get into a better scalability zone. 



6 CONCLUSION 
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In this paper, we have used event-based simulation as an 
exploratory tool to accomplish several things. Simulation 
has confirmed the USL parametric modeling equation as 
being physical in the sense that it corresponds to the syn- 
chronous bound on throughput in a particular queueing 



model: a prepping machine repairman (Sect. 4.5). This 



result is the generalization of an earlier theorem concern- 
ing a queueing interpretation of Amdahl's law based on 
rational functions |Gun02| . 

By virtue of our approach, we have shown that Amdahl 
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